Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RATIS-2185. Improve gRPC log messages debugability. #1184

Closed
wants to merge 1 commit into from

Conversation

jojochuang
Copy link
Contributor

What changes were proposed in this pull request?

Log StatusRuntimeException full stack trace.
I think part of the problem we saw in RATIS-2135 as well as some other related issues is that StatusRuntimeException only logs the exception messages. Therefore it "hides" the severity of the issue.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-2185

How was this patch tested?

No test. This is a logging only change.

@szetszwo
Copy link
Contributor

@jojochuang , the StatusRuntimeException stack trace often is useless since the StatusRuntimeException is received from the StreamObserver.onError method. The stack trace contains only the gRPC internal code.

@jojochuang
Copy link
Contributor Author

Thanks @szetszwo.
Will the java utility logger to slf4j adapter solve the problem for me (it redirects stderr to role log)? like this one: https://github.com/apache/ozone/pull/7494/files
will it shed more info about an error that way?

@szetszwo
Copy link
Contributor

szetszwo commented Nov 27, 2024

Will the java utility logger to slf4j adapter solve the problem for me (it redirects stderr to role log)? ...

Do you mean that printing the stack trace in stderr?

I really think that the stack trace is not useful since it will contain the gRPC and Netty internal code. The long stack trace also makes it harder to read the log. That's why we changed the code to not printing it in RATIS-230. Get a taste below (copied from https://discuss.daml.com/t/extractor-issue-io-grpc-statusruntimeexception-unavailable-io-exception/517)

15:08:29.970 [default-akka.actor.default-dispatcher-166] WARN  a.s.s.RestartWithBackoffSource - Restarting graph due to failure. stack_trace: 
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at io.grpc.Status.asRuntimeException(Status.java:533)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:442)
	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700)
	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:399)
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:507)
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:66)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:627)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:515)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:686)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:675)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:247)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1140)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 more

@jojochuang jojochuang closed this Nov 28, 2024
@jojochuang
Copy link
Contributor Author

Ok I'll close the PR.

@szetszwo
Copy link
Contributor

@jojochuang , thanks for your understanding. I can see two options:

  1. Make the stack trace configurable, or
  2. Print only once -- detect if the exception (with the same stack) is repeated and not printing the stack trace again.

What do you think?

@szetszwo
Copy link
Contributor

One more option.

  1. Print also the cause exception (but not the stack trace), e.g. the above exception will be printed as below:
15:08:29.970 [default-akka.actor.default-dispatcher-166] WARN  a.s.s.RestartWithBackoffSource - Restarting graph due to failure. stack_trace: 
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Caused by: java.io.IOException: Connection reset by peer

@szetszwo
Copy link
Contributor

szetszwo commented Nov 29, 2024

@jojochuang , let's do option 3 first since it is easy.

BTW, found that some exceptions are not printed in GrpcAdminProtocolService; see #1186

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants